Extracting Multiword Terms from Document Collections

نویسندگان

  • Joaquim Ferreira da Silva
  • Gabriel Pereira Lopes
چکیده

Multiword terms (MWTs) are relevant strings of words in text collections. Once they are automatically extracted, they may be used by an Information Retrieval system, suggesting its users possible conceptual interesting refinements of their information needs. As a matter of fact, these multiword terms point to relevant information, often corresponding to topics and subtopics in the text collection, and maybe quite useful specially for highly refining generic queries. In this paper, we introduce the LocalMaxs algorithm, for automatically extracting multiword terms. This algorithm requires neither empirically suggested thresholds nor complex linguistic filters nor language specific morpho-syntactic rules. These features make this algorithm a suitable approach to extract MWTs from text collections written in any language. Moreover, by introducing the Fair Dispersion Point Normalization concept, we can deal with arbitrarily long MWTs and can compare the results obtained by using different word association measures for MWTs selection. We also introduce our own association measure, the SCP, to work with the LocalMaxs algorithm, and assess the results obtained by comparing it with related statistics-based measures (Specific Mutual Information, Dice, Loglike and  coefficients) used in experiments on a text collection. An Information Retrieval application using our approach is also presented.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Extracting Multiwords From Large Document Collection Based N-Gram

Multiword terms (MWTs) are relevant strings of words in text collections. Once they are automatically extracted, they may be used by an Information Retrieval system, suggesting its users possible conceptual interesting refinements of their information needs. As a matter of fact, these multiword terms point to relevant information, often corresponding to topics and subtopics in the text collecti...

متن کامل

Combining Linguistics with statistics for multiword term extraction: a fruitfull association?

The acquisition of multiword terms from large text collections is a fundamental issue in the context of Information Retrieval. Indeed, their identification leads to improvements in the indexing process and allows guiding the user in his search for information. In this paper, we present an original methodology that allows extracting multiword terms by either (1) exclusively considering statistic...

متن کامل

$xwrpdwlf 'lvfryhu\ Dqg $jjuhjdwlrq Ri &rpsrxqg 1dphv Iru Wkh 8vh Lq .qrzohgjh 5hsuhvhqwdwlrqv

$EVWUDFW Automatic acquisition of information structures like Topic Maps or semantic networks from large document collections is an important issue in knowledge management. An inherent problem with automatic approaches is the treatment of multiword terms as single semantic entities. Taking company names as an example, we present a method for learning multiword terms from large text corpora expl...

متن کامل

$xwrpdwlff'lvfryhu\dqgg$jjuhjdwlrqqrii&rpsrxqgg 1dphvviruuwkhh8vhhlq.qrzohgjhh5hsuhvhqwdwlrqvv

Automatic acquisition of information structures like Topic Maps or semantic networks from large document collections is an important issue in knowledge management. An inherent problem with automatic approaches is the treatment of multiword terms as single semantic entities. Taking company names as an example, we present a method for learning multiword terms from large text corpora exploiting th...

متن کامل

Extracting Multiword Translations from Aligned Comparable Documents

Most previous attempts to identify translations of multiword expressions using comparable corpora relied on dictionaries of single words. The translation of a multiword was then constructed from the translations of its components. In contrast, in this work we try to determine the translation of a multiword unit by analyzing its contextual behaviour in aligned comparable documents, thereby not p...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999